{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c2907fddfeac343a",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "# Semantic Search with Semantic Text\n",
    "\n",
    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/search/09-semantic-text.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "\n",
    "Learn how to use the [semantic_text](https://www.elastic.co/guide/en/elasticsearch/reference/current/semantic-text.html) field type to quickly get started with semantic search."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3db37d2cf8264468",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## Requirements\n",
    "\n",
    "For this example, you will need:\n",
    "\n",
    "- An Elastic deployment:\n",
    "  - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook))\n",
    "\n",
    "- Elasticsearch 9.1 or above, or [Elasticsearch serverless](https://www.elastic.co/elasticsearch/serverless)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7fe1ed0703a8d1d3",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## Create Elastic Cloud deployment\n",
    "\n",
    "If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9c8bd62c8241f90",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## Install packages and connect with Elasticsearch Client\n",
    "\n",
    "To get started, we'll need to connect to our Elastic deployment using the Python client (version 8.15.0 or above).\n",
    "Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n",
    "\n",
    "First we need to `pip` install the following packages:\n",
    "\n",
    "- `elasticsearch`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13fdf7656ced2da3",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "!pip install elasticsearch"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d54b112361d2f3d",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Next, we need to import the modules we need. \n",
    "\n",
    "🔐 NOTE: getpass enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a60627704e77ff6",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "from elasticsearch import Elasticsearch, exceptions\n",
    "from urllib.request import urlopen\n",
    "from getpass import getpass\n",
    "import json\n",
    "import time"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb9498124146d8bb",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Now we can instantiate the Python Elasticsearch client.\n",
    "\n",
    "First we prompt the user for their password and Cloud ID.\n",
    "Then we create a `client` object that instantiates an instance of the `Elasticsearch` class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e14437dcce0f235",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
    "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
    "\n",
    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
    "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n",
    "\n",
    "# Create the client instance\n",
    "client = Elasticsearch(\n",
    "    # For local development\n",
    "    # hosts=[\"http://localhost:9200\"]\n",
    "    cloud_id=ELASTIC_CLOUD_ID,\n",
    "    api_key=ELASTIC_API_KEY,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89b6b7721f6d8599",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "### Enable Telemetry\n",
    "\n",
    "Knowing that you are using this notebook helps us decide where to invest our efforts to improve our products. We would like to ask you that you run the following code to let us gather anonymous usage statistics. See [telemetry.py](https://github.com/elastic/elasticsearch-labs/blob/main/telemetry/telemetry.py) for details. Thank you!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a7af618fb61f358",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "!curl -O -s https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/telemetry/telemetry.py\n",
    "from telemetry import enable_telemetry\n",
    "\n",
    "client = enable_telemetry(client, \"09-semantic-text\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cbbdaf9118a97732",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "### Test the Client\n",
    "Before you continue, confirm that the client has connected with this test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4cb0685fae12e034",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "print(client.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59e2223bf2c4331",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Refer to [the documentation](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new) to learn how to connect to a self-managed deployment.\n",
    "\n",
    "Read [this page](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new) to learn how to connect using API keys."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "818f7a72a83b5776",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## Create the Index\n",
    "\n",
    "Now we need to create an index with a `semantic_text` field. Let's create one that enables us to perform semantic search on movie plots.\n",
    "We use the preconfigured `.elser-2-elastic` inference endpoint which uses the ELSER model via the [Elastic Inference Service](https://www.elastic.co/docs/explore-analyze/elastic-inference/eis)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ace87760606f67c6",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "client.indices.delete(index=\"semantic-text-movies\", ignore_unavailable=True)\n",
    "client.indices.create(\n",
    "    index=\"semantic-text-movies\",\n",
    "    mappings={\n",
    "        \"properties\": {\n",
    "            \"title\": {\"type\": \"text\"},\n",
    "            \"genre\": {\"type\": \"text\"},\n",
    "            \"plot\": {\"type\": \"text\", \"copy_to\": \"plot_semantic\"},\n",
    "            \"plot_semantic\": {\n",
    "                \"type\": \"semantic_text\",\n",
    "                \"inference_id\": \".elser-v2-elastic\",\n",
    "            },\n",
    "        }\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abc3ee7a1fddfa9b",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Notice how we configured the mappings. We defined `plot_semantic` as a `semantic_text` field.\n",
    "The `inference_id` parameter defines the inference endpoint that is used to generate the embeddings for the field.\n",
    "Then we configured the `plot` field to [copy its value](https://www.elastic.co/guide/en/elasticsearch/reference/current/copy-to.html) to the `plot_semantic` field.\n",
    "\n",
    "While `copy_to` is not required to use `semantic_text`, it enables use cases like hybrid search where semantic and lexical techniques are used together. We will cover a hybrid search example later in this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b5a46b60660a489",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## Populate the Index\n",
    "\n",
    "Let's populate the index with our example dataset of 12 movies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24f0133923553d28",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/movies.json\"\n",
    "response = urlopen(url)\n",
    "movies = json.loads(response.read())\n",
    "\n",
    "operations = []\n",
    "for movie in movies:\n",
    "    operations.append({\"index\": {\"_index\": \"semantic-text-movies\"}})\n",
    "    operations.append(movie)\n",
    "client.bulk(index=\"semantic-text-movies\", operations=operations, refresh=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fff5932fcbac1b0",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## Semantic Search\n",
    "\n",
    "Now that our index is populated, we can query it using semantic search.\n",
    "\n",
    "### Aside: Pretty printing Elasticsearch search results\n",
    "\n",
    "Your `search` API calls will return hard-to-read nested JSON.\n",
    "We'll create a little function called `pretty_search_response` to return nice, human-readable outputs from our examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad417b4b3f50c889",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "def pretty_search_response(response):\n",
    "    if len(response[\"hits\"][\"hits\"]) == 0:\n",
    "        print(\"Your search returned no results.\")\n",
    "    else:\n",
    "        for hit in response[\"hits\"][\"hits\"]:\n",
    "            id = hit[\"_id\"]\n",
    "            score = hit[\"_score\"]\n",
    "            title = hit[\"_source\"][\"title\"]\n",
    "            runtime = hit[\"_source\"][\"runtime\"]\n",
    "            plot = hit[\"_source\"][\"plot\"]\n",
    "            keyScene = hit[\"_source\"][\"keyScene\"]\n",
    "            genre = hit[\"_source\"][\"genre\"]\n",
    "            released = hit[\"_source\"][\"released\"]\n",
    "\n",
    "            pretty_output = f\"\\nID: {id}\\nScore: {score}\\nTitle: {title}\\nRuntime: {runtime}\\nPlot: {plot}\\nKey Scene: {keyScene}\\nGenre: {genre}\\nReleased: {released}\"\n",
    "\n",
    "            print(pretty_output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22c4d4d395adb472",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "### Semantic Search with the `semantic` Query\n",
    "\n",
    "We can use the [`semantic` query](https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-semantic-query.html) to quickly & easily query the `semantic_text` field in our index.\n",
    "Under the hood, an embedding is automatically generated for our query text using the `semantic_text` field's inference endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a8520ffc8a3efb3",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "response = client.search(\n",
    "    index=\"semantic-text-movies\",\n",
    "    query={\"semantic\": {\"field\": \"plot_semantic\", \"query\": \"organized crime movies\"}},\n",
    ")\n",
    "\n",
    "pretty_search_response(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "148fda24a3964aa9",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "These results demonstrate the power of semantic search.\n",
    "Our top results are all movies involving organized crime, even if the exact term \"organized crime\" doesn't appear in the plot description.\n",
    "This works because the ELSER model understands the semantic similarity between terms like \"organized crime\" and \"mob\".\n",
    "\n",
    "However, these results also show the weaknesses of semantic search.\n",
    "Because semantic search is based on vector similarity, there is a long tail of results that are weakly related to our query vector.\n",
    "That's why movies like _The Matrix_ are returned towards the tail end of our search results."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c9bab225a745746",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "### Hybrid Search with the `semantic` Query\n",
    "\n",
    "We can address some of the issues with pure semantic search by combining it with lexical search techniques.\n",
    "Here, we use a [boolean query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html) to require that all matches contain at least term from the query text, in either the `plot` or `genre` fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f72f7906b918dc1",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "response = client.search(\n",
    "    index=\"semantic-text-movies\",\n",
    "    query={\n",
    "        \"bool\": {\n",
    "            \"must\": {\n",
    "                \"multi_match\": {\n",
    "                    \"fields\": [\"plot\", \"genre\"],\n",
    "                    \"query\": \"organized crime movies\",\n",
    "                    \"boost\": 1.5,\n",
    "                }\n",
    "            },\n",
    "            \"should\": {\n",
    "                \"semantic\": {\n",
    "                    \"field\": \"plot_semantic\",\n",
    "                    \"query\": \"organized crime movies\",\n",
    "                    \"boost\": 3.0,\n",
    "                }\n",
    "            },\n",
    "        }\n",
    "    },\n",
    ")\n",
    "\n",
    "pretty_search_response(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d50d10ced4389107",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "These results demonstrate that the application of lexical search techniques can help focus the results, while retaining many of the advantages of semantic search.\n",
    "In this example, the top search results are all still movies involving organized crime, but the `multi_match` query keeps the long tail shorter and focused on movies in the crime genre.\n",
    "\n",
    "The `copy_to` parameter we defined in the mapping enables this query pattern. It ensures that the content provided for the `plot` field is indexed both lexically and semantically.\n",
    "\n",
    "Note the `boost` parameters applied to the `multi_match` and `semantic` queries.\n",
    "Combining lexical and semantic search techniques in a boolean query like this is called \"linear combination\" and when doing this, it is important to normalize the scores of the component queries.\n",
    "This involves consideration of a few factors, including:\n",
    "\n",
    "- The range of scores generated by the query\n",
    "- The relative importance and accuracy of the query in the context of the dataset\n",
    "\n",
    "In this example, the `multi_match` query is mostly used as a filter to constrain the search results' long tail, so we assign it a lower boost than the `semantic` query."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78be304240d6c695",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## Conclusion\n",
    "\n",
    "The [semantic_text](https://www.elastic.co/guide/en/elasticsearch/reference/current/semantic-text.html) field type is a powerful tool that can help you quickly and easily integrate semantic search.\n",
    "It can greatly improve the relevancy of your search results, particularly when combined with lexical search techniques."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
